Picture for Xinsheng Wang

Xinsheng Wang

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Add code
Mar 03, 2025
Figure 1 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 2 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 3 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Figure 4 for Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Viaarxiv icon

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Add code
Feb 25, 2025
Figure 1 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 2 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 3 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Figure 4 for Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Viaarxiv icon

Audio-FLAN: A Preliminary Release

Add code
Feb 23, 2025
Figure 1 for Audio-FLAN: A Preliminary Release
Figure 2 for Audio-FLAN: A Preliminary Release
Figure 3 for Audio-FLAN: A Preliminary Release
Figure 4 for Audio-FLAN: A Preliminary Release
Viaarxiv icon

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Add code
Feb 06, 2025
Viaarxiv icon

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Add code
Jan 28, 2025
Viaarxiv icon

EDSep: An Effective Diffusion-Based Method for Speech Source Separation

Add code
Jan 27, 2025
Figure 1 for EDSep: An Effective Diffusion-Based Method for Speech Source Separation
Figure 2 for EDSep: An Effective Diffusion-Based Method for Speech Source Separation
Figure 3 for EDSep: An Effective Diffusion-Based Method for Speech Source Separation
Viaarxiv icon

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Add code
Jan 08, 2025
Figure 1 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Figure 2 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Figure 3 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Figure 4 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Viaarxiv icon

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Add code
Aug 05, 2024
Figure 1 for StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
Figure 2 for StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
Figure 3 for StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion
Viaarxiv icon

SCDNet: Self-supervised Learning Feature-based Speaker Change Detection

Add code
Jun 12, 2024
Viaarxiv icon

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

Add code
Feb 07, 2024
Figure 1 for StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
Figure 2 for StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
Figure 3 for StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
Figure 4 for StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
Viaarxiv icon